Most common plots
Visualisations play a key role in data analysis by helping us understand patterns, relationships, and distributions within datasets.
Let’s take a moment to review some of the most commonly used types of plots in research. We’ll begin by using functions built into base R, which are ideal for producing quick visualisations when you want to explore data efficiently and aesthetic refinement isn’t a priority.
We will use the mtcars dataset, which comes built into
R. This dataset originates from the 1974 Motor Trend US magazine and
contains fuel consumption and performance data for 32 car models from
the 1970s. It includes variables that let us explore relationships
between car design (such as weight, engine size, or number of cylinders)
and performance characteristics (like fuel efficiency or
acceleration).
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Histograms
A histogram is a type of plot used to visualise the distribution of a single continuous variable. It helps reveal the overall shape of the data (for example, whether it is normally distributed or skewed) and can also make it easier to spot potential outliers.
For instance, let’s explore how the variable mpg, which
represents miles per gallon for each car in the dataset, is
distributed:
hist(mtcars$mpg,
main = "Histogram of Miles per Gallon (MPG)",
xlab = "Miles per Gallon", breaks = 5,
col = "lightblue",
border = "white",
xlim = c(min(mtcars$mpg), 35))This plot shows where the cars fall in terms of miles per gallon and
how much variation exists across models. By changing the number of bins
using the breaks argument, we can vary how much detail we
get:
hist(mtcars$mpg,
main = "Histogram of Miles per Gallon (MPG)",
xlab = "Miles per Gallon", breaks = 10,
col = "lightblue",
border = "white",
xlim = c(min(mtcars$mpg), 35))Density plots
A density plot is a smoothed version of a histogram. Instead of grouping data into bins, it estimates the probability density of the variable across its range. Compared to a histogram, a density plot provides a smooth continuous curve, making it easier to see the overall shape of the distribution, and it also allows easier comparison between multiple groups (e.g., overlaying several densities). It also does not depend on arbitrary bin widths that can affect histogram shape.
plot(density(mtcars$mpg),
main = "Density Plot of Miles per Gallon (MPG)",
xlab = "Miles per Gallon",
col = "darkblue",
lwd = 2)Bar plots
A bar plot is a popular way to compare quantities across discrete categories, making it easy to see differences between groups. For example, if we want to examine the average fuel efficiency of cars with different numbers of cylinders, we can use a bar plot:
barplot(tapply(mtcars$mpg, mtcars$cyl, mean),
main = "Average MPG by Number of Cylinders",
xlab = "Cylinders",
ylab = "Average MPG",
col = "orange")This code takes the variable mtcars$mpg, which we want
to summarise, and the grouping variable mtcars$cyl, and
applies the mean function to each group. In other words, it
calculates the average miles per gallon (mpg) within each
cylinder category.
The tapply() function in R performs this operation by
applying a specified function (here, mean) to subsets of a
vector (mtcars$mpg) defined by a categorical variable
(mtcars$cyl). It returns a named vector, where the
names correspond to the cylinder categories (4, 6, and 8) and the values
represent the average mpg for each group.
Finally, the barplot() function takes this named vector
and creates a bar for each cylinder group, using the category names as
labels and the mean mpg values as the bar heights.
One important point to note is that while the bar plot shows the averages, it does not display the variation within each category. Means can be heavily influenced by outliers, so it is generally not advisable to show only the mean values. You should always include standard errors or some measure of variation, so that readers can see the uncertainty associated with the averages:
# function to calculate standard error
se <- function(x) sd(x) / sqrt(length(x))
# calculate mean MPG and standard errors by cylinder
avg_mpg <- tapply(mtcars$mpg, mtcars$cyl, mean)
se_mpg <- tapply(mtcars$mpg, mtcars$cyl, se)
# create bar plot
bp <- barplot(avg_mpg,
main = "Average MPG by Cylinder with Standard Errors",
xlab = "Cylinders",
ylab = "Average MPG",
col = "orange",
ylim = c(0, max(avg_mpg + se_mpg) + 2)) # extend y-axis for error bars
# Add error bars (2 SEs)
arrows(x0 = bp, y0 = avg_mpg - 2*se_mpg,
x1 = bp, y1 = avg_mpg + 2*se_mpg,
angle = 90, code = 3, length = 0.1, col = "black")Stacked / proportional bar plots
If you are comparing parts of a whole across different categories, it can be useful to use a stacked or proportional bar plot. Stacked bar plots show how subgroups contribute to totals, while proportional stacked bar plots standardise those totals so you can compare relative proportions even when overall group sizes differ.
For example, we can use a stacked bar plot to compare the number of automatic versus manual transmissions across different numbers of cylinders. We start by creating a table with frequencies for the relevant levels of each variable:
##
## 0 1
## 4 3 8
## 6 4 3
## 8 12 2
We can then plot the data from this table as follows:
barplot(
t(tbl), # need to transpose to show n cylinders on x-axis
beside = FALSE, # stacked (not grouped)
main = "Transmission type by number of cylinders",
xlab = "Number of cylinders",
ylab = "Count",
col = c("skyblue", "orange"),
legend.text = c("Automatic", "Manual"),
names.arg = rownames(tbl)) # adds 4, 6, 8 as x-axis labelsThe legend is overlapping the plot, so let’s move it outside the plotting area:
# increase right margin so there is room for the legend
par(mar = c(5, 4, 4, 9)) # bottom, left, top, right
# Stacked bar plot
barplot(
t(tbl),
beside = FALSE,
main = "Transmission type by number of cylinders",
xlab = "Number of cylinders",
ylab = "Count",
col = c("skyblue", "orange"),
names.arg = rownames(tbl),
legend.text = c("Automatic", "Manual"),
args.legend = list(x = "topright", inset = c(-0.3, 0), xpd = TRUE)) # moves legend outsideIf you want to look at proportions instead of raw counts, you can use
a proportional bar plot, which normalises each bar to 100% height and
shows relative proportions rather than counts. To do this, we first
convert the counts in the frequency table to proportions using the
prop.table() function. How the proportions are calculated
is controlled by the margin = argument: if it is set to 1,
each row will sum to 100%:
par(mar = c(5, 4, 4, 9))
# Create proportional table
prop_tbl <- prop.table(tbl, margin = 1)
# Proportional stacked bar plot
barplot(
t(prop_tbl), # transpose to have cylinders on x-axis
beside = FALSE, # stacked bars
main = "Transmission type by number of cylinders",
xlab = "Number of cylinders",
ylab = "Proportion",
col = c("skyblue", "orange"),
names.arg = rownames(tbl),
legend.text = c("Automatic", "Manual"),
args.legend = list(x = "topright", inset = c(-0.3, 0), xpd = TRUE))These plots can be useful, but one disadvantage is that only the bottom segment shares a common baseline, making the middle and top segments harder to compare across categories. I also wouldn’t recommend using this type of plot with more than two, or at most three, categories, as it quickly becomes visually busy and difficult to interpret. A better alternative might be faceted plots, which we’ll look at later.
Box plots
The bar plot with standard errors is certainly an improvement, but we
can provide even more information about the distribution of
mpg across cars with different numbers of cylinders. For
instance, it is often useful to see the medians,
quartiles, and potential outliers in the data. A good
way to visualise all of this at once is with a box plot.
boxplot(mpg ~ as.factor(cyl),
data = mtcars,
main = "Box plot of MPG by Number of Cylinders",
xlab = "Cylinders",
ylab = "Miles per Gallon",
col = c("lightgreen", "lightblue", "lightpink"))In this graph,
the thick line inside each box shows the median
mpgfor that cylinder group;the heights of the boxes show the middle 50% of the data (called the interquartile range), which gives a sense of variability (here, we see that there is much more variability across cars with 4 cylinders than across cars with a higher number of cylinders);
the lines extending from the boxes are called whiskers and reach to the smallest and largest values within 1.5 times the interquartile range;
hollow circles with black outlines indicate points beyond 1.5 times the interquartile range and are often referred to as potential outliers.
Overall, the box plot reveals that cars with 4 cylinders generally
have higher mpg, but there is some overlap with 6-cylinder
cars, and a few 8-cylinder cars have unusually low mpg
values. We also see that fuel efficiency varies much more in 4-cylinder
cars than in cars with more cylinders. This level of detail is not
visible in a simple bar plot.
Note that a box plot can also be used to examine a single continuous variable; it is not limited to comparing two variables:
boxplot(mtcars$mpg,
main = "Box plot of Miles per Gallon (MPG)",
ylab = "Miles per Gallon",
col = "lightblue")Box-and-whiskers plots are certainly an improvement over simple bar plots. However, while we can see summary statistics such as medians and quartiles, we still don’t get a full picture of the data distribution. If we could combine the box plot with a density plot, we would have a much richer visualisation. Fortunately, there is a type of plot that does exactly that - the raincloud plot. Base R doesn’t include a built-in function for these plots, but we will revisit this later when we explore specialised plotting packages in R.
Scatter plots
So far, we have explored how to examine the summary statistics and distribution of a single continuous variable using histograms, density plots, and box plots. We also saw how to investigate the relationship between a continuous variable and a categorical variable with bar plots and box plots. But how can we visualise the relationship between two continuous variables? The most commonly used plot for this purpose is the scatter plot.
plot(mtcars$wt, mtcars$mpg,
main = "Scatter plot of Weight vs. MPG",
xlab = "Weight (1000 lbs)",
ylab = "Miles per Gallon",
pch = 19,
col = "darkgreen")The plot above allows us to examine how a car’s weight
(wt) relates to its miles per gallon (mpg) in
the mtcars dataset. Each point represents an individual
car, and we can observe a negative relationship: heavier cars tend to
have lower fuel efficiency.
Scatter plots like this are particularly useful for exploring associations or correlations between two continuous variables. They help identify trends, clusters, or unusual observations in the data and are often used in conjunction with correlation and regression analyses.
Line plots
Another type of plot that is very common and useful in research when working with two continuous variables is a line plot. As the name suggests, this plot is a line! It is typically used to show how a continuous variable changes over an ordered sequence, such as time, an index, or any naturally ordered variable. Line plots are particularly helpful for identifying trends, patterns, or fluctuations over that sequence.
Line plots are similar to scatter plots in that they depict a relationship between two variables, but here the focus is on the change within one variable across an ordered set of observations, which is represented by the other variable.
We do not really have variables that change over time in the
mtcars dataset, but we can use other built-in datasets to
illustrate this. For example, the AirPassengers dataset can
be used to show monthly totals of international airline passengers from
1949 to 1960:
plot(AirPassengers,
type = "l",
main = "Monthly Airline Passengers (1949–1960)",
xlab = "Year",
ylab = "Number of Passengers",
col = "blue",
lwd = 2)It shows clearly that, while there are fluctuations within each year, the overall number of passengers steadily increases across the years.
Similarly, we can use the pressure dataset to examine
how the vapor pressure of mercury changes as a function of
temperature:
plot(pressure$temperature, pressure$pressure,
type = "l",
main = "Vapor pressure of mercury vs. Temperature",
xlab = "Temperature (°C)",
ylab = "Pressure (mm Hg)",
col = "red",
lwd = 2)We can see that vapor pressure changes systematically with the
ordered variable Temperature.
Interim summary
We’ve now looked at some of the most common plots using functions built into R. These functions are useful because the plots are quick to create and require very little information. If all you need is a fast visual, they work perfectly. In fact, even without specifying any attributes, you can still get a sense of the data:
However, these base R plots are not particularly pretty, and it can be cumbersome to add layers or extra information. They also lack the ability to create more complex visuals, such as heatmaps or Sankey-style diagrams. For this reason, base R plotting functions are generally not used for figures in scientific publications.
Instead, most people use ggplot2, a visualisation
package in R that is both powerful and flexible. Since its introduction,
it has become the gold standard for data visualisation
in R. One of its greatest advantages is that it was designed to work
seamlessly with tidyr and dplyr, providing
full integration that makes data cleaning and visualisation workflows
straightforward.
To load ggplot2, you can either use
library("ggplot2"), or you can load the
tidyverse package, which includes ggplot2
along with tidyr, dplyr, and other useful
packages:
Plotting with ggplot2
In ggplot2, plots are built up using a layered
approach, where each layer adds a different component to the
visualisation.
The data layer specifies the dataset you are working with;
The aesthetics layer (
aes) maps variables to visual properties, such as x- and y-axes, colour, size, or shape;Geometric objects (
geoms) define the type of plot you want, such as points, lines, bars, or boxplots;Facets allow you to create multiple subplots based on a categorical variable, making it easy to compare groups side by side;
The statistics layer can add summary calculations, such as means, medians, or regression lines;
The coordinates layer controls the coordinate system and axes;
Finally, the theme layer is used to adjust the overall appearance of the plot, including fonts, colours, grid lines, and spacing.
We will now see how this works in practice by recreating the figures
we previously plotted using base R functions, but this time using
ggplot2.
Histograms
# data and aesthetics layer
ggplot(mtcars, aes(x = mpg)) +
# geometric object layer
geom_histogram()Let’s break down what’s happening here step by step:
The first line of code,
ggplot(mtcars, aes(x = mpg)), tellsggplot2which dataset to use (mtcars). This represents the first layer - the data layer.The
aes()function defines the aesthetics layer, which maps the variablempgto the x-axis. At this stage, no plot is drawn yet, we are simply specifying the data and how it will be represented.The next step is to choose the type of plot, which is done using the geometric object layer (
geom).ggplot2provides many geoms, but here we choose thegeom_histogram()to draw a histogram.
At this point, the histogram is drawn, but it doesn’t yet match the appearance of the base R example. To adjust its look, we add arguments inside the geom:
bins = 5sets the number of bars in the histogram;fill = "lightblue"specifies the colour inside the bars;color = "white"sets the border colour of the bars.
These arguments allow us to customise the basic appearance of the histogram while keeping the underlying data and mapping unchanged:
# data and aesthetics layer
ggplot(mtcars, aes(x = mpg)) +
# geometric object layer
geom_histogram(bins = 5,
fill = "lightblue",
color = "white")We continue building the plot using the labs() function,
which adds a title and labels for the x- and y-axes. This helps make the
plot easier to read and understand:
# data and aesthetics layer
ggplot(mtcars, aes(x = mpg)) +
# geometric object layer
geom_histogram(bins = 5,
fill = "lightblue",
color = "white") +
# labels
labs(title = "Histogram of Miles per Gallon (MPG)",
x = "Miles per Gallon",
y = "Count") Finally, the default grey background and grid don’t look nice and add
visual clutter. We can remove these using the theme layer:
theme_classic() provides a simple, clean look with a white
background and no gridlines.
# data and aesthetics layer
ggplot(mtcars, aes(x = mpg)) +
# geometric object layer
geom_histogram(bins = 5,
fill = "lightblue",
color = "white") +
# labels
labs(title = "Histogram of Miles per Gallon (MPG)",
x = "Miles per Gallon",
y = "Count") +
# theme
theme_classic() You may be wondering how to know which options to use to make
specific changes or what kinds of customisations are available. There
are many excellent resources that explain what ggplot2 can
do and how to take full advantage of its flexibility (see final section
of this file for links). As a starting point, you might explore the official documentation and and
the extensive collection of examples at https://r-graph-gallery.com/ggplot2-package.html.
Density plots
Let’s now recreate the density plot we made earlier. The data and
aesthetics remain unchanged because we are still working with the
mpg variable. The theme also stays the same, as we want the
clean white background provided by theme_classic(). The
geometric object layer is now geom_density() instead of
geom_histogram(), which creates a smooth curve representing
the distribution of mpg. Finally, the labels are updated
using labs() to reflect that this is a density plot rather
than a histogram:
ggplot(mtcars, aes(x = mpg)) +
geom_density() +
labs(title = "Density plot of Miles per Gallon (MPG)", x = "Miles per Gallon", y = "Density") +
theme_classic() Like with geom_histogram(), to change the appearance of
the curve, we can use arguments like fill,
color, and adjust:
fill = "lightblue"sets the colour under the density curve;color = "white"sets the border of the curve;adjust = 1.5controls the smoothness of the curve (larger values make it smoother).
ggplot(mtcars, aes(x = mpg)) +
geom_density(fill = "lightblue", color = "white", adjust = 1.5) +
labs(title = "Density plot of Miles per Gallon (MPG)", x = "Miles per Gallon", y = "Density") +
theme_classic() Box plots with one variable
Like before, we only need to make a few small changes to the plot.
The geom is now geom_boxplot(), and we can
specify the fill and color arguments to adjust
its appearance. We also update the title and axis labels to reflect that
we are now creating a box plot. The main change, however, is that we map
the variable mpg to the y-axis instead of the x-axis,
because this is what the geom_boxplot() function expects.
As a result, the axis labels change accordingly.
ggplot(mtcars, aes(y = mpg)) +
geom_boxplot(fill = "lightblue",
color = "black") +
labs(title = "Box plot of Miles per Gallon (MPG)",
y = "Miles per Gallon") +
theme_classic()Raincloud plots with one variable
Recall that we mentioned earlier that a raincloud plot combines features of a box plot and a density plot. Like a box plot, it shows the median and interquartile range, but it also displays the full distribution of the data through the shape of a rotated density plot. This allows us to see both the summary statistics and the underlying distribution in one figure.
Also remember that base R does not include a built-in function for
creating raincloud plots — but now that we are using
ggplot2, we can easily produce one with just a few lines of
code.
The first step is to produce the basic shape, and we can do this
using the geom_violin geom:
ggplot(mtcars, aes(x = "", y = mpg)) +
geom_violin(fill = "lightblue",
color = "black") +
labs(title = "Violin plot of Miles per Gallon (MPG)",
y = "Miles per Gallon", x = NULL) +
theme_classic() geom_violin() requires both x and
y aesthetics. However, since we are only plotting one
variable, we set x = "" (an empty string) to tell
ggplot2 to treat all the data as belonging to a single
category. geom_violin() then uses this “dummy” x variable
to create a single violin representing the distribution of
mpg. Setting x = NULL inside
labs() removes the unnecessary x-axis label.
mtcars dataset.
Let’s now add the interquartile range:
ggplot(mtcars, aes(x = "", y = mpg)) +
geom_violin(fill = "lightblue",
color = "black") +
geom_boxplot(width = 0.1, fill = "blue",
outlier.colour = "blue", outlier.shape = 16, outlier.size = 2) +
labs(title = "Violin plot of Miles per Gallon (MPG)",
y = "Miles per Gallon", x = NULL) +
theme_classic()Now add the individual data points:
ggplot(mtcars, aes(x = "", y = mpg)) +
geom_violin(fill = "lightblue",
color = "black") +
geom_boxplot(width = 0.1, fill = "blue",
outlier.colour = "blue", outlier.shape = 16, outlier.size = 2) +
geom_jitter(width = 0.05, alpha = 0.6, color = "black") +
labs(title = "Violin plot of Miles per Gallon (MPG)",
y = "Miles per Gallon", x = NULL) +
theme_classic()Might also be useful to flip this plot so that the distribution lies
horizontally because this makes the plot more visually appealing, we use
coord_flip():
ggplot(mtcars, aes(x = "", y = mpg)) +
geom_violin(fill = "lightblue",
color = "black") +
geom_boxplot(width = 0.1, fill = "blue",
outlier.colour = "blue", outlier.shape = 16, outlier.size = 2) +
geom_jitter(width = 0.05, alpha = 0.6, color = "black") +
labs(title = "Violin plot of Miles per Gallon (MPG)",
y = "Miles per Gallon", x = NULL) +
theme_classic() +
coord_flip() # flips the axesSee the little tick mark? Let’s remove it. To do this, we need to
change settings in the theme() area:
ggplot(mtcars, aes(x = "", y = mpg)) +
geom_violin(fill = "lightblue",
color = "black") +
geom_boxplot(width = 0.1, fill = "blue",
outlier.colour = "blue", outlier.shape = 16, outlier.size = 2) +
geom_jitter(width = 0.05, alpha = 0.6, color = "black") +
labs(title = "Violin plot of Miles per Gallon (MPG)",
y = "Miles per Gallon", x = NULL) +
theme_classic() + coord_flip() +
theme(
axis.ticks.y = element_blank()) # remove x-axis ticksThis is also where we can change things like size and face of axes labels:
ggplot(mtcars, aes(x = "", y = mpg)) +
geom_violin(fill = "lightblue",
color = "black") +
geom_boxplot(width = 0.1, fill = "blue",
outlier.colour = "blue", outlier.shape = 16, outlier.size = 2) +
geom_jitter(width = 0.05, alpha = 0.6, color = "black") +
labs(title = "Violin plot of Miles per Gallon (MPG)",
y = "Miles per Gallon", x = NULL) +
theme_classic() + coord_flip() +
theme(
axis.ticks.y = element_blank(),
axis.text.x = element_text(size = 12), # increase y-axis tick labels
axis.title.x = element_text(size = 14)) # y-axis title sizeLooks nice, doesn’t it? We can make the plot even cleaner by reducing
it to a half violin, which shows the same information but takes up less
space and makes it easier to compare multiple groups side by side (which
will be relevant later). To do this, we’ll use the ggdist
package, which provides the stat_halfeye() function
designed specifically for half violins and other advanced distribution
visualisations:
library("ggdist")
# specify dataset and map mpg to y-axis (x=1 is a placeholder since we have one group)
ggplot(mtcars, aes(x = 1, y = mpg)) +
# add a half-violin shape to show the distribution of mpg
stat_halfeye(
adjust = 0.5, # controls smoothing of the density curve
justification = -0.3, # moves the half violin horizontally
fill = "lightblue", # fill colour for the violin
# .width = 0, # uncomment to remove onfidence intervals
# point_colour = NA, # uncomment to remove the mean
position = position_nudge(x = -0.15)) + # nudges violin down or spacing
# add a boxplot showing median and interquartile range
geom_boxplot(
width = 0.1, # makes the boxplot narrower
fill = "lightblue", # match violin colour
outlier.colour = "black", # colour for outlier points
outlier.shape = 16, # solid round outlier points
outlier.size = 2, # size of outlier points
position = position_nudge(x = -0.30)) + # moves the boxplot vertically
# add individual data points for more detail
geom_jitter(
width = 0.05, # horizontal spread of points
alpha = 0.6, # transparency level
color = "blue") + # point colour
# add plot title and axis labels
labs(title = "Raincloud plot of Miles per Gallon (MPG)",
y = "Miles per Gallon",
x = NULL) +
# flip coordinates so the violin lies horizontally
coord_flip() +
# use a clean theme
theme_classic() +
# remove y-axis ticks and labels (not needed for single group)
theme(axis.ticks.y = element_blank(),
axis.text.y = element_blank())So much is happening here, but this plot gives us a lot of information! We can clearly see the distribution of the data through the half-violin, the mean and its 95% confidence interval, and the boxplot showing the median and interquartile range. On top of that, we also have the individual data points (spread out with a bit of random noise using jitter so that overlapping points are easier to see). This combination gives us a rich and detailed view of the data — both the overall pattern and the individual observations.
Let’s now see how we can create plots with ggplot2 that
show the relationship between two (or more) variables. We’ll
look at bar plots, scatter plots, and line plots, as well as how to use
box plots and raincloud plots for visualising differences across
groups.
Bar plots
Recall that we said earlier that bar plots are much more useful when
we are also plotting variation, so the first step to do is to compute
standard errors. One way to do this was shown above, here we can look at
two other ways to do that. We can keep using the function
se we’ve defined above with combination with some
dplyr functions:
# Summarise data: mean and standard error of mpg by cylinder
mpg_summary <- mtcars %>%
dplyr::group_by(cyl) %>%
dplyr::summarise(
mean_mpg = mean(mpg),
se_mpg = se(mpg))This code creates a new data frame with the summary statistics - very useful as we can draw data from here when we plot:
## # A tibble: 3 x 3
## cyl mean_mpg se_mpg
## <dbl> <dbl> <dbl>
## 1 4 26.7 1.36
## 2 6 19.7 0.549
## 3 8 15.1 0.684
Alternatively, we can use the function summarySE from
the Rmisc package, which does the same in one line, and
does not require defining a function to compute SEs:
library("Rmisc")
mpg_summary <- summarySE(mtcars, measurevar = "mpg", groupvars = "cyl")
mpg_summary## cyl N mpg sd se ci
## 1 4 11 26.66364 4.509828 1.3597642 3.029743
## 2 6 7 19.74286 1.453567 0.5493967 1.344325
## 3 8 14 15.10000 2.560048 0.6842016 1.478128
summarySEwithin() function
from the same package (see here
for package documentation) and consider in which experimental designs it
would be more appropriate to use than summarySE().
Ok, now we are ready to plot!
# Start a ggplot object using mpg_summary dataset
# by mapping x-axis to cylinders, y-axis to mean mpg
ggplot(mpg_summary, aes(x = factor(cyl), y = mpg)) +
# define fill colour and border colour of the bars
geom_col(fill = "orange", color = "black") +
# add SEs
geom_errorbar(
aes(ymin = mpg - 2*se, ymax = mpg + 2*se),
width = 0.2, color = "black") +
labs(title = "Average MPG by Cylinder",
x = "Cylinders",
y = "Average MPG") +
theme_classic()Sometimes, people like to use different colours for the levels of a factor variable, like so:
ggplot(mpg_summary, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) +
geom_col(color = "black") +
geom_errorbar(
aes(ymin = mpg - 2*se, ymax = mpg + 2*se),
width = 0.2, color = "black") +
labs(title = "Average MPG by Cylinder",
x = "Cylinders",
y = "Average MPG") +
theme_classic()People often leave their plots looking like this. I can see three problems here - can you spot any?
The first issue is that the legend title for the cylinder variable isn’t very descriptive. We can fix that by giving it a clearer label, like this:
ggplot(mpg_summary, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) +
geom_col(color = "black") +
geom_errorbar(
aes(ymin = mpg - 2*se, ymax = mpg + 2*se),
width = 0.2, color = "black") +
labs(title = "Average MPG by Cylinder",
x = "Cylinders",
y = "Average MPG",
fill = "Number of cylinders") + # set legend for fill
theme_classic()However, in this particular case, we don’t actually need a legend. The x-axis already clearly tells us which bar corresponds to which cylinder type, so the legend just adds unnecessary visual clutter. That’s our second issue. So let’s remove this legend:
ggplot(mpg_summary, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) +
geom_col(color = "black") +
geom_errorbar(
aes(ymin = mpg - 2*se, ymax = mpg + 2*se),
width = 0.2, color = "black") +
labs(title = "Average MPG by Cylinder",
x = "Cylinders",
y = "Average MPG") +
theme_classic() +
theme(legend.position = "none") # removes fill legendBut the main problem remains: the use of colour here doesn’t actually add any new information. Each bar already represents a different number of cylinders, so using different colours for each one is redundant. In fact, it can be confusing - the viewer might start wondering what the colours are supposed to mean.
A cleaner approach is to use a single colour for all bars. For example, we can make all bars orange, or simply use black outlines for a minimalist look:
ggplot(mpg_summary, aes(x = factor(cyl), y = mpg)) +
geom_col(color = "black") +
geom_errorbar(
aes(ymin = mpg - 2*se, ymax = mpg + 2*se),
width = 0.2, color = "black") +
labs(title = "Average MPG by Cylinder",
x = "Cylinders",
y = "Average MPG") +
theme_classic()This illustrates an important principle of data visualisation: every design choice should serve a purpose. Colour, in particular, should be used meaningfully - not just for decoration. Aim to make your plots as minimal as possible while remaining as informative as possible.
Stacked / proportional bar plots
As mentioned earlier, these plots are not always the most helpful,
but let’s still see how we can recreate them with ggplot2
as an exercise.
Last time, we computed a frequency table using the
table() function. Now, we’ll see how to do the same thing
using tidyverse functions:
mtcars2 <- mtcars %>%
mutate(cyl = factor(cyl), # coverts cyl to factor
am = factor(am, labels = c("Automatic", "Manual"))) %>% # assignes descriptive labels to am
dplyr::count(cyl, am) # counts for each combination
mtcars2## cyl am n
## 1 4 Automatic 3
## 2 4 Manual 8
## 3 6 Automatic 4
## 4 6 Manual 3
## 5 8 Automatic 12
## 6 8 Manual 2
Now we are ready to plot:
ggplot(mtcars2, aes(x = cyl, y = n, fill = am)) +
geom_bar(stat = "identity") +
labs(title = "Transmission type by number of cylinders",
x = "Number of cylinders",
y = "Count",
fill = "Transmission") +
theme_classic()An important point to note is that here we are using
geom_bar() rather than geom_col() in
combination with geom_errorbar() as we did before. In
ggplot2, geom_bar() by default counts the
number of observations in each category. However, since we have
already calculated these counts ourselves, we set
stat = "identity" so that geom_bar() uses the
pre-calculated values in the n column.
If we wanted geom_bar() to count the observations
automatically, we would simply use it without
stat = "identity", like this:
# no counting is done here
mtcars3 <- mtcars %>%
mutate(cyl = factor(cyl),
am = factor(am, labels = c("Automatic", "Manual")))
ggplot(mtcars3, aes(x = cyl, fill = am)) + # y-axis not specified
geom_bar() + # ggplot counts the number of rows for each group automatically
labs(title = "Transmission type by number of cylinders",
x = "Number of cylinders",
y = "Count",
fill = "Transmission") +
theme_classic()scale_fill_manual() to set the colours to
"skyblue" and "orange".
To create a proportional stacked bar plot, we need to make a slight
change in how we compute mtcars2, because now we want to
work with proportions:
mtcars2 <- mtcars %>%
mutate(cyl = factor(cyl),
am = factor(am, labels = c("Automatic", "Manual"))) %>%
dplyr::count(cyl, am) %>%
dplyr::group_by(cyl) %>% # separately for each category of cyl
mutate(prop = n / sum(n)) # calculate proportions within each category
mtcars2## # A tibble: 6 x 4
## # Groups: cyl [3]
## cyl am n prop
## <fct> <fct> <int> <dbl>
## 1 4 Automatic 3 0.0938
## 2 4 Manual 8 0.25
## 3 6 Automatic 4 0.125
## 4 6 Manual 3 0.0938
## 5 8 Automatic 12 0.375
## 6 8 Manual 2 0.0625
The last two lines here do the trick. We first tell R to treat each
cylinder category (4, 6, 8) as a separate group using the
group_by() function, so that all follow-up calculations are
done within each group rather than across the whole dataset. We then
divide the count for each transmission type n by the total
count of cars in that cylinder group sum(n) and store the
result in a new column prop. This converts the raw counts
into proportions.
And now let’s plot:
ggplot(mtcars2, aes(x = cyl, y = prop, fill = am)) +
geom_bar(stat = "identity") +
labs(title = "Transmission type by number of cylinders",
x = "Number of cylinders",
y = "Proportion",
fill = "Transmission") +
theme_classic()Note that, here again, we specify stat = "identity"
because we don’t want geom_bar() to automatically compute
raw counts as we are interested in proportions. However, we could
achieve the same result by setting the position argument of
geom_bar() to "fill", like this:
ggplot(mtcars3, aes(x = cyl, fill = am)) +
geom_bar(position = "fill") +
labs(title = "Transmission type by number of cylinders",
x = "Number of cylinders",
y = "Proportion",
fill = "Transmission") +
theme_classic()The position argument in geom_bar() (and
other geoms) controls how multiple bars or segments are arranged
relative to each other. If we set it to "fill", the bars
are scaled to the same height (100%), which shows proportions rather
than raw counts.
position argument
to "dodge": can you figure out what "dodge"
does? We’ve already seen another use of position when we
worked with boxplots and stat_halfeye; how does
"dodge" compare to that?
Scatter plots
Let’s now recreate the scatter plot of weight and miles per gallon
that we made earlier. To create a plot with points, we use
geom_point(). Everything else follows the same logic as
before:
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point(color = "darkgreen", size = 3) +
labs(title = "Scatter plot of Weight vs. MPG",
x = "Weight (1000 lbs)",
y = "Miles per Gallon") +
theme_classic()One additional thing worth demonstrating here is how to adjust the font size of the axis labels and tick marks to make the plot easier to read:
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point(color = "darkgreen", size = 3) +
labs(title = "Scatter Plot of Weight vs. MPG",
x = "Weight (1000 lbs)",
y = "Miles per Gallon") +
theme_classic() +
theme(
axis.title = element_text(size = 14),
axis.text = element_text(size = 12))One thing that we often want to do is show how a relationship between two variables is affected by another variable - this is what we are interested in when we run models with an interaction effect. For example, this scatter plot seems to suggest that heavier cars tend to have lower miles per gallon, but how does this interact with the number of cylinders? We can bring in this extra variable using colour:
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
geom_point(size = 3) +
labs(title = "Scatter plot of Weight vs. MPG by number of cylinders",
x = "Weight (1000 lbs)",
y = "Miles per Gallon",
color = "Number of cylinders") +
theme_classic()Note that in this case, we do need the legend for colour because if we just look at dots with different colours we have no idea what colour represents here!
Now what we also see that is that the most efficient cars tend to be the lighest ones and also have the smallest number of cylinders, while the least efficient cars are also heaviest and have the highest number of cylinders. Neat!
Another way to bring in a third variable - if it is categorical - is to use facets:
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
geom_point(size = 3) +
labs(title = "Scatter Plot of Weight vs. MPG Faceted by number of cylinders",
x = "Weight (1000 lbs)",
y = "Miles per Gallon",
color = "Number of cylinders") +
facet_wrap(~cyl) + # creates one facet per cylinder type
theme_classic()Here, facet_wrap(~cyl) tells ggplot2 to
create a separate plot for each unique value of cyl. By
default, all facets share the same axis scales, which makes comparison
easier. If you want each facet to have its own scale, you can add
scales = "free" inside facet_wrap(), but
personally I think it’s very rarely worth using different scales because
it often makes the comparison between facets really difficult and thus
hides an interesting effect. For example:
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
geom_point(size = 3) +
labs(title = "Scatter Plot of Weight vs. MPG Faceted by number of cylinders",
x = "Weight (1000 lbs)",
y = "Miles per Gallon",
color = "Number of cylinders") +
facet_wrap(~cyl, scales = "free") +
theme_classic()I think this has made the graph worse! Now it actually doesn’t tell the story we want it to tell, so there is little point in such a plot.
Number of cylinders is a categorical variable - what if we wanted to
add a continuous variable, for example
horsepower?
ggplot(mtcars, aes(x = wt, y = mpg, color = hp)) +
geom_point(size = 3) +
labs(title = "Scatter Plot of Weight vs. MPG by Horsepower",
x = "Weight (1000 lbs)",
y = "Miles per Gallon",
color = "Horsepower") +
theme_classic()hp is a continuous variable so R understands to use a
scale instead of distinct colours, with darker colours indicating lower
horsepower and lighter colours indicating higher hoursepower. Actually,
this seems a bit unintuitive, so let’s change it so that darker colours
indicate higher values:
ggplot(mtcars, aes(x = wt, y = mpg, color = hp)) +
geom_point(size = 3) +
labs(title = "Scatter Plot of Weight vs. MPG by Horsepower",
x = "Weight (1000 lbs)",
y = "Miles per Gallon",
color = "Horsepower") +
scale_color_continuous(trans = "reverse") + # reverses the default colour gradient
theme_classic()Line plots
Let’s now see how we can recreate the line plots from before - remember that these are useful when we want to illustrate change over time.
First, we convert the time series to a data frame for ggplot to work:
air_passengers_df <- data.frame(
Year = as.numeric(time(AirPassengers)),
Passengers = as.numeric(AirPassengers))Now we can plot using geom_line:
ggplot(air_passengers_df, aes(x = Year, y = Passengers)) +
geom_line(color = "blue", linewidth = 1) +
labs(title = "Monthly Airline Passengers (1949–1960)",
x = "Year",
y = "Number of Passengers") +
theme_classic()I think it would be nice if we could show each year:
ggplot(air_passengers_df, aes(x = Year, y = Passengers)) +
geom_line(color = "blue", linewidth = 1) +
labs(title = "Monthly Airline Passengers (1949–1960)",
x = "Year",
y = "Number of Passengers") +
scale_x_continuous(breaks = seq(1949, 1960, by = 1)) + # show each year
theme_classic()If you feel like it’s a bit difficult to read all the years when they sit next to one another, we can also tilt them at 45 degrees. We can also add more detail for the y-axis:
ggplot(air_passengers_df, aes(x = Year, y = Passengers)) +
geom_line(color = "blue", linewidth = 1) +
scale_x_continuous(
breaks = seq(1949, 1960, by = 1)) + # one tick per year
scale_y_continuous(
breaks = seq(100, 600, by = 100)) + # custom y-axis breaks
labs(title = "Monthly Airline Passengers (1949–1960)",
x = "Year",
y = "Number of Passengers") +
theme_classic() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # tilt x-axis labelsNow we have enough information to see what exactly is happening each year!
Let’s do the second line plot from before on vapor pressure and temperature:
ggplot(pressure, aes(x = temperature, y = pressure)) +
geom_line(color = "red", linewidth = 1) +
labs(title = "Vapor Pressure of Mercury vs. Temperature",
x = "Temperature (°C)",
y = "Pressure (mm Hg)") +
theme_classic()Box plots with 2 variables
We’ve looked at how to create box plots and we’ve also looked at how
to represent variables with colour. Let’s combine these strands of
knowledge to create a box plot for mpg by number of
cylinders (bute note that here again the use of colour does not convey
any additional information and is therefore unnecessary):
ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) +
geom_boxplot() +
labs(title = "Box Plot of MPG by number of cylinders",
x = "Cylinders",
y = "Miles per Gallon",
fill = "Cylinders") +
theme_classic()scale_fill_manual() to set custom fill colours for each
cylinder group.
Raincloud plots with 2 variables
Finally, let’s see how we can create a nice raincloud plot for two variables!
# now n cylinders is on the x axis
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
stat_halfeye(
adjust = 0.5,
justification = -0.3,
fill = "lightblue",
.width = 0,
point_colour = NA,
position = position_nudge(x = -0.15)) +
geom_boxplot(
width = 0.1,
fill = "lightblue",
outlier.colour = "black",
outlier.shape = 16,
outlier.size = 2,
position = position_nudge(x = -0.25)) +
geom_jitter(
width = 0.05,
alpha = 0.6,
color = "blue") +
labs(title = "Raincloud Plot of Miles per Gallon (MPG) by number of cylinders",
x = "Number of cylinders",
y = "Miles per Gallon") +
theme_classic() + coord_flip()Interesting! Now we clearly see that the distribution of miles per gallon in cars with 4 cylinders is closer to uniform than in cars with 8 cylinders, for instance. Again, this just shows how effective raincloud plots are!
Statistical summaries in ggplot2
One of the great strengths of ggplot2 is that it allows
you to compute summary statistics on the fly and overlay them on your
plots. This is done using statistical layers, such as
stat_summary() or specialised geoms like
geom_smooth().
Using stat_summary()
stat_summary() lets you compute a function (like mean,
median, standard deviation, or standard error) for each group and
display it as a point, line, or bar. For example, we can plot the mean
MPG for each cylinder group with standard error bars like so:
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
stat_summary(
fun = mean, # compute mean
geom = "point", # display as points
color = "red",
size = 3) +
stat_summary(
fun.data = mean_se, # compute one SE of the mean
geom = "errorbar", # draws vertical error bars showing uncertainty
width = 0.1,
color = "black") +
labs(title = "Mean MPG by number of cylinders",
x = "Cylinders",
y = "Miles per Gallon") +
theme_classic()hp) and weight (weight) on the x-axis instead
of mpg.
Adding regression lines or smoothing
Another common use of statistical layers is modeling relationships
with geom_smooth():
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE, color = "blue") +
labs(title = "MPG vs. Weight with a regression line",
x = "Weight (1000 lbs)",
y = "Miles per Gallon") +
theme_classic()Here, geom_smooth(method = "lm") fits a linear model to
the data, and se = TRUE adds a shaded confidence interval
around the regression line.
Instead of a linear regression (method = "lm"), you can
use a loess smoother (method = "loess") to capture more
complex relationships:
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_smooth(method = "loess", se = TRUE, color = "blue", fill = "lightblue") + # light blue shading instead of default grey
labs(title = "MPG vs. Weight",
x = "Weight (1000 lbs)",
y = "Miles per Gallon") +
theme_classic()It looks like for these two variables, a straight line is a good fit, but a loess line is useful when you suspect that the relationship is not strictly linear.
hp) and weight (weight) using
geom_smooth().
Combining plots
So far, we’ve created individual plots, but sometimes we want to
display several plots together in a single figure - for example, to
compare different variables or visualisations side by side. This can be
done easily using the ggarrange() function from the
ggpubr package.
The first step is to create the individual plots that you want to combine into a single figure and store them as separate objects in your environment. For example, if we are analysing how cars’ fuel efficiency relates to their weight and horsepower, we might want to create a figure with two panels: one showing the relationship between fuel efficiency and weight, and the other showing the relationship between fuel efficiency and horsepower.
We already have the code for the first plot, so all we need to do now is store it as an object in our environment, like this (note that I’ve removed the title because, in academic publications, we typically don’t include titles in figures):
mpg_vs_wt_plot <- ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_smooth(method = "loess", se = TRUE, color = "blue", fill = "lightblue") +
labs(x = "Weight (1000 lbs)",
y = "Miles per Gallon") +
theme_classic()The next step is to create the second plot:
mpg_vs_hp_plot <- ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_point() +
geom_smooth(method = "loess", se = TRUE, color = "blue", fill = "lightblue") +
labs(x = "Horsepower",
y = "Miles per Gallon") +
scale_x_continuous(breaks = seq(0, max(mtcars$hp), by = 50)) + # breaks every 50 units
theme_classic()
mpg_vs_hp_plotNow we are ready to combine the two plots into one:
library("ggpubr")
fig1 <- ggarrange(
mpg_vs_wt_plot, mpg_vs_hp_plot, # plots you want to combine
ncol = 2, nrow = 1, # 2 columns, 1 row (side by side)
labels = c("A", "B")) # optional labels for subplots
fig1The ggarrange() function offers several arguments that
can help make your combined figures look cleaner and more professional.
For example, if the plots you’re combining share the same legend, you
can specify a common legend for both plots. This helps reduce repetition
and save space, and makes the figure easier to read:
# mpg vs wt, with different colours for different n cylinders
# (plotted above already)
mpg_wt_cyl_plot <- ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
geom_point(size = 3) +
labs(x = "Weight (1000 lbs)",
y = "Miles per Gallon",
color = "Number of cylinders") +
theme_classic()
# mpg vs hp, with different colours for different n cylinders
mpg_hp_cyl_plot <- ggplot(mtcars, aes(x = hp, y = mpg, color = factor(cyl))) +
geom_point(size = 3) +
labs(x = "Horsepower",
y = "Miles per Gallon",
color = "Number of cylinders") +
theme_classic()
fig2 <- ggarrange(
mpg_wt_cyl_plot, mpg_hp_cyl_plot,
ncol = 2, nrow = 1,
labels = c("A", "B"))
fig2And now combine with a common legend:
fig3 <- ggarrange(
mpg_wt_cyl_plot, mpg_hp_cyl_plot,
ncol = 2, nrow = 1,
labels = c("A", "B"),
common.legend = TRUE) # use common legend
fig3As you can see, the legend is currently placed at the top of the plot, which I think looks nice. However, we could also move it to the bottom if we prefer:
fig4 <- ggarrange(
mpg_wt_cyl_plot, mpg_hp_cyl_plot,
ncol = 2, nrow = 1,
labels = c("A", "B"),
common.legend = TRUE,
legend = "bottom") # change placement of legend
fig4ggarrange(). Experiment with the layout by arranging the
plots side by side or stacked vertically, and explore other arguments to
make the combined figure visually appealing.
Exporting plots
A key consideration when creating figures is how to export them and ensure that the exported version maintains high quality. R provides several ways to export plots, and the choice depends on your needs, such as the desired file format, resolution, and whether you want a vector or raster image. Raster images are made of pixels and can become blurry when enlarged, while vector images are made of shapes and lines, can be scaled infinitely without losing quality.
Raster formats:
PNG: good for web use and general-purpose images, supports transparency;
JPEG/JPG: suitable for photographs, but uses lossy compression to reduce the file size, which can reduce quality;
TIFF: high-quality, lossless format often used in publications.
Vector formats:
PDF: ideal for figures in publications or reports; scalable without losing quality;
SVG: scalable vector graphics; great for web graphics and figures you may want to edit later.
EPS: traditional vector format used in journals.
When submitting a paper to a journal, you can usually choose the format of your figures; however, once your paper is accepted, you will need to provide high-quality vector images, so it’s important to understand what they are and how to create them.
You may have noticed that the Plots pane in R provides a clickable option to export the image currently displayed:
You can use this option; however, this is not ideal for high-quality
or publication-ready figures, as it often produces raster images with
limited resolution and less control over dimensions and formatting. The
best option is to use the ggsave()command, which supports
multiple formats and allows precise control over dimensions and
resolution:
# save as a png
ggsave("fig4.png", plot = fig4, width = 8, height = 6, dpi = 300,
bg = "transparent")
# save as a pdf
ggsave("fig4.pdf", plot = fig4, width = 8, height = 6)For raster images, set the dpi to 300 or higher to ensure high-quality output. To determine the figure size, consider the intended use: for a paper, follow the journal’s author guidelines, which typically specify figure width (in inches or centimeters); for presentation slides (e.g., a 16:9 aspect ratio), a figure around 8–10 inches wide by 5–6 inches tall usually works well.
With ggsave(), you can specify the width and height in
inches or pixels. For raster images, it’s usually best to use pixels,
while for vector outputs, specifying dimensions in inches is more
natural.
A note on colours
So far, we’ve relied on the default colours in base R or
ggplot2 (with a few exceptions). While these are fine for
quick plots, you’ll often want to customise colours to make your figures
clearer, more appealing, or suitable for publication. There are many
ways to define and modify colours in R, depending on what you need.
R recognises colour names (like “orange”, “skyblue”, “darkgreen”) and
also accepts hexadecimal colour codes, such as “#E69F00” (a shade of
orange) or “#56B4E9” (light blue). You can find a full list of named
colours by typing colors() in R, or explore palettes on
websites such as:
When choosing colours, it’s important to think about
accessibility and contrast —
especially if your plots might be printed in black and white, or viewed
by people with colour vision deficiencies. Packages like
RColorBrewer and viridis provide
colourblind-friendly palettes that look great in both colour and
grayscale.
Changing colours for a categorical variable
Recall the scatter plot we created earlier where ggplot2
automatically assigned discrete colours to different numbers of
cylinders. We can manually define our own using
scale_color_manual() (and the companion function for filled
plots is scale_fill_manual()):
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
geom_point(size = 3) +
labs(title = "Scatter Plot of Weight vs. MPG by cylinder count",
x = "Weight (1000 lbs)",
y = "Miles per Gallon",
color = "Number of cylinders") +
scale_color_manual(values = c("4" = "#56B4E9", # blue
"6" = "#E69F00", # orange
"8" = "#009E73")) + # green
theme_classic()Here we’ve manually mapped each cylinder group to a specific colour
using hex codes. The scale_color_manual() function is used
when your variable is discrete.
Changing colours for a continuous variable
When we mapped horsepower (hp) earlier, we used a
continuous colour scale. For these types of variables, you can use
scale_color_gradient() for more control:
ggplot(mtcars, aes(x = wt, y = mpg, color = hp)) +
geom_point(size = 3) +
labs(title = "Scatter Plot of Weight vs. MPG by Horsepower",
x = "Weight (1000 lbs)",
y = "Miles per Gallon",
color = "Horsepower") +
scale_color_gradient(low = "lightblue", high = "darkblue") +
theme_classic()This maps lower horsepower values to lighter blue and higher values to darker blue.
Using colourblind-friendly palettes
If you’re preparing figures for a publication or presentation, it’s a
good idea to use palettes that are easy to interpret by everyone. One of
the most popular and accessible options comes from the
viridis package:
library("viridis")
ggplot(mtcars, aes(x = wt, y = mpg, color = hp)) +
geom_point(size = 3) +
scale_color_viridis(option = "plasma") +
labs(title = "Scatter plot of Weight vs. MPG",
x = "Weight (1000 lbs)",
y = "Miles per Gallon",
color = "Horsepower") +
theme_classic()The viridis palette ensures consistent contrast across
the scale and remains readable in grayscale or by people with colour
vision deficiencies.
So lots of options for colour! The most important thing is to remember that whatever colour you use, it should add information, not confusion!
Advanced plots
While ggplot2 excels at standard plots (scatter, bar,
line, etc.), you can also create more specialised visualisations, often
with the help of additional packages.
Heatmaps
A heatmap displays values in a matrix as coloured tiles, making it ideal for visualising correlations or any grid-based data.
For example, if we want to examine how each variable in the
mtcars dataset correlates with the others, we can compute a
correlation matrix and explore it:
## mpg cyl disp hp drat wt qsec vs am gear carb
## mpg 1.00 -0.85 -0.85 -0.78 0.68 -0.87 0.42 0.66 0.60 0.48 -0.55
## cyl -0.85 1.00 0.90 0.83 -0.70 0.78 -0.59 -0.81 -0.52 -0.49 0.53
## disp -0.85 0.90 1.00 0.79 -0.71 0.89 -0.43 -0.71 -0.59 -0.56 0.39
## hp -0.78 0.83 0.79 1.00 -0.45 0.66 -0.71 -0.72 -0.24 -0.13 0.75
## drat 0.68 -0.70 -0.71 -0.45 1.00 -0.71 0.09 0.44 0.71 0.70 -0.09
## wt -0.87 0.78 0.89 0.66 -0.71 1.00 -0.17 -0.55 -0.69 -0.58 0.43
## qsec 0.42 -0.59 -0.43 -0.71 0.09 -0.17 1.00 0.74 -0.23 -0.21 -0.66
## vs 0.66 -0.81 -0.71 -0.72 0.44 -0.55 0.74 1.00 0.17 0.21 -0.57
## am 0.60 -0.52 -0.59 -0.24 0.71 -0.69 -0.23 0.17 1.00 0.79 0.06
## gear 0.48 -0.49 -0.56 -0.13 0.70 -0.58 -0.21 0.21 0.79 1.00 0.27
## carb -0.55 0.53 0.39 0.75 -0.09 0.43 -0.66 -0.57 0.06 0.27 1.00
However, this large table of numbers can be hard to interpret, so it’s more useful to visualise it as a heatmap:
# Ccnvert to long format using pivot_longer
cor_mat_long <- as.data.frame(cor_mat) %>%
mutate(Var1 = rownames(.)) %>%
pivot_longer(-Var1, names_to = "Var2", values_to = "Cor_coef")
head(cor_mat_long)## # A tibble: 6 x 3
## Var1 Var2 Cor_coef
## <chr> <chr> <dbl>
## 1 mpg mpg 1
## 2 mpg cyl -0.85
## 3 mpg disp -0.85
## 4 mpg hp -0.78
## 5 mpg drat 0.68
## 6 mpg wt -0.87
# plot
ggplot(cor_mat_long, aes(x = Var1, y = Var2, fill = Cor_coef)) +
geom_tile() +
scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
theme_minimal() +
labs(title = "Correlation Heatmap")Ridgeline plots
Ridgeline plots are a great way to visualise the distribution of a
continuous variable across multiple groups. They combine the idea of
density plots with a stacked layout, making it easy to compare
distributions across categories. The ggridges package
provides a simple way to create these plots in ggplot2.
For example, in the mtcars dataset, we can visualise how
mpg is distributed for cars with different numbers of
cylinders:
library("ggridges")
ggplot(mtcars, aes(x = mpg, y = factor(cyl), fill = factor(cyl))) +
geom_density_ridges(alpha = 0.7) +
theme_classic() +
labs(
title = "MPG distribution by cylinder count",
y = "Number of cylinders",
x = "Miles per Gallon",
fill = "Number of cylinders")Sankey / alluvial diagrams
Sankey or alluvial diagrams are excellent for visualising flows between categories or the distribution of cases across multiple categorical variables. They show how observations move from one category to another, with the width of the flows proportional to counts or weights.
In R, the ggalluvial package integrates seamlessly with
ggplot2 to create these diagrams. For example, we can use
the Titanic dataset used in the Intro
to Quantitative Analysis in R session to visualise how passengers
were distributed across class and survival status:
library("ggalluvial")
# convert Titanic dataset to data frame
data <- as.data.frame(Titanic)
# Create Sankey / alluvial plot
ggplot(data = data,
aes(axis1 = Class, axis2 = Survived, y = Freq)) +
geom_alluvium(aes(fill = Survived)) +
geom_stratum() +
geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
theme_classic() +
theme(
axis.text.x = element_blank(), # remove x-axis labels
axis.ticks.x = element_blank()) + # remove x-axis ticks +
labs(title = "Titanic Survival Sankey Diagram",
y = "Number of passengers",
x = "Passenger Class / Survival")Here,
axis1andaxis2define the categorical axes (passanger class and survival status);y = Freqspecifies the size of each flow based on passenger counts;geom_alluvium()draws the flows connecting categories;geom_stratum()creates blocks for each category;geom_text(stat = "stratum")labels the blocks with category names;fill = Survivedcolours the flows according to survival status.
Alluvial diagrams are particularly useful for tracking transitions, showing how a population splits across categories, or highlighting patterns in multistage processes. They can also be extended to more than two axes if needed.
3D plots
You can also create 3D plots in R to visualise surfaces or relationships among three variables. However, keep in mind that 3D plots often add complexity and can make it harder for viewers to interpret the data compared to well-designed 2D plots.
Also note that ggplot2 is primarily a 2D plotting
system, so creating real 3D plots requires other packages such as
plotly, rgl, or base R functions like
persp().
Here’s an example using base R’s persp() function to
show the surface of a volcano:
# 3D surface of the volcano dataset
volcano_matrix <- volcano
x <- 1:nrow(volcano_matrix)
y <- 1:ncol(volcano_matrix)
persp(x, y, volcano_matrix,
theta = 30, phi = 30,
col = "lightblue",
shade = 0.5,
xlab = "X", ylab = "Y", zlab = "Height")And here is an interactive 3D plot using the plotly
package, which allows you to rotate, zoom, and explore the surface
dynamically:
Further reading and resources
Everything we’ve covered so far is just a small sample - there’s virtually no limit to what you can do with plots in R!
Use the resources provided in this file to explore other types of visualisations and discover the full range of possibilities. In addition, the following websites are great starting points:
R Graph Gallery: https://r-graph-gallery.com/
Intro to
ggplot2: https://ggplot2.tidyverse.org/articles/ggplot2.htmlBasic plotting with
ggplot2: https://bookdown.org/rdpeng/RProgDA/basic-plotting-with-ggplot2.htmlData visualisation with
ggplot2- Cheat Sheet: https://rstudio.github.io/cheatsheets/html/data-visualization.htmlR for Data Science (R4DS) - Visualisation Chapter: https://r4ds.had.co.nz/data-visualisation.html
List of resources for visualisations in R: https://github.com/erikgahner/awesome-ggplot2